Conference where mixing is time controlled by a rendering device

ABSTRACT

A telecommunications terminal hosts a conference mixer adapted to enable an at least audio conference between a first conference peer and at least two further conference peers. The conference mixer includes for each of the at least two further conference peers, a respective first data buffer configured to buffer portions of at least an audio data stream received from the respective conference peer; a first audio data stream portions mixer fed by the first data buffers and configured to: a) get audio data stream portions buffered in the first data buffers; b) mix the audio data stream portions from the first data buffers to produce a first mixed audio data portion; and c) feed the first mixed audio data portion to a rendering device of the telecommunications terminal, wherein the first audio data stream portions mixer is configured to perform operations a), b) and c) upon receipt of a notification from the rendering device indicating that the rendering device is ready to render a new mixed audio data portion.

BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention generally relates to the field oftelecommunications, and particularly to audio or audio/videoconferencing. Specifically, the invention concerns a telecommunicationsterminal hosting an audio, or an audio/video conference mixer.

2. Description of the Related Art

In the field of telecommunications, the diffusion of Voice over InternetProtocol (VoIP) services, and of devices supporting them, is growingrapidly. A similar rapid growth is experienced by video communication(VDC) services and supporting devices.

Most often, services of this kind involve two intercommunicating peers,but an interesting extension is represented by “virtual” audio and/orvideo conferencing, where more than two parties (“peers”) are involvedin the audio and/or video communication session, and can interact witheach other by listening/speaking and/or viewing.

Apparatuses that enable virtual audio and/or video conferences are knownas “conference mixers”. Essentially, a conference mixer gathers theaudio and/or video contents generated by local capturing devices(microphones, videocameras) provided in user terminals (the “endpoints”)at each of the conferencing parties, properly mixes the gathered audioand/or video contents, and redistributes the mixed contents to everyparty to the virtual conference.

Conventionally, conference mixers are apparatuses distinct and remotefrom the endpoints of the conferencing parties, being core networkapparatuses (referred to as “Master Control Units”, shortly MCUs).

Solutions for incorporating conference mixing functions in the endpointsof the conference peers are known in the art.

For example, in the published patent application US 2003/0142662 apacket data terminal is disclosed, particularly a personal computer,personal digital assistant, telephone, mobile radiotelephone, networkaccess device, Internet peripheral and the like, which initiates,coordinates and controls the provision of on-demand conference callservices, with little or no network support. A digital-to-analogconverter for converting first and second packet data stream intoseparate analog representation; a selective mixer manipulates the analogrepresentations to provide a mixed output; a multiplexer circuitdistributes the packet data stream to a plurality of call sessions.

SUMMARY OF THE INVENTION

The Applicant from one hand observes that the implementation of virtualaudio or audio/video conference services based on the provision ofdedicated core network equipments (the MCUs) is not satisfactory, mainlybecause it impacts the telephone/telecommunications network structure,and involves costs for the network operators. Thus, the Applicantbelieves that a different implementation of virtual conference services,in which an audio or audio/video conference mixing functionality ishosted at the endpoint of at least one of the peers engaged in thevirtual audio or audio/video conference is better, because it hasessentially no impact on the telephone/telecommunications network.

Nevertheless, the Applicant has observed that an important aspect thatremains to be carefully considered is the reduction, as far as possible,of the end-to-end delay which is experienced by the peers engaged in avirtual conference.

The Applicant has tackled the problem of how to reduce the end-to-enddelay in virtual audio or audio/video conference services to be enjoyedthrough a conference mixer hosted in an endpoint of one of theconference peers.

The Applicant has found that the end-to-end delay experienced in virtualaudio or audio/video conferences can be reduced, provided that themixing operations are timed by the rendering device(s) and/or thecapturing device(s) of the endpoint hosting the conference mixer.

According to an aspect of the present invention, a telecommunicationsterminal is provided, hosting a conference mixer adapted to enabling anat least audio conference between a first conference peer and at leasttwo further conference peers. The conference mixer comprises:

-   -   for each of the at least two further conference peers, a        respective first data buffer configured to buffering portions of        at least an audio data stream received from the respective        conference peer;    -   a first audio data stream portions mixer fed by the first data        buffers and configured to:        -   a) get audio data stream portions buffered in the first data            buffers;        -   b) mix the audio data stream portions got from the first            data buffers to produce a first mixed audio data portion;            and        -   c) feed the first mixed audio data portion to a rendering            device of the telecommunications terminal,

wherein said first audio data stream portions mixer is configured toperform operations a), b) and c) upon receipt of a notification fromsaid rendering device indicating that the rendering device is ready torender a new mixed audio data portion.

For the purposes of the present invention, by “audio conference” thereis meant a virtual conference between three or more peers, including atleast audio. Possibly, the audio conference could also include video,i.e. it could be an audio/video virtual conference.

According to another aspect of the present invention, a method ofperforming an at least audio conference between a first conference peerand at least two further conference peers, the method comprising:

-   -   at a telecommunications terminal of the first conference peer,        performing a first buffering of portions of at least an audio        data stream received from each of the at least two further        conference peer; and    -   upon receipt of a notification from a rendering device of the        telecommunications terminal, indicating that the rendering        device is ready to render a new mixed audio data portion:        -   mixing the audio data stream portions buffered in said first            buffering, to produce the mixed audio data portion; and    -   feeding the mixed audio data portion to the rendering device.

BRIEF DESCRIPTION OF THE DRAWINGS

The features and advantages of the present invention will be bestunderstood reading the following detailed description of an exemplaryembodiment thereof, provided merely by way of non-limitative example,description that will be conducted making reference to the annexeddrawings, wherein:

FIG. 1 schematically shows a scenario where the present invention isapplied;

FIG. 2 schematically shows, in terms of functional blocks, the structureof an audio conference mixer according to an embodiment of the presentinvention;

FIG. 3 schematically shows the main functional components of acommunication terminal adapted to host the audio conference mixer ofFIG. 2;

FIG. 4 shows in greater detail the structure of the audio conferencemixer according to an embodiment of the present invention;

FIG. 5 depicts, in terms of a schematic flowchart, the operation of theaudio conference mixer of FIG. 4, in an embodiment of the presentinvention;

FIG. 6 is a time diagram illustrating the timing of a process ofreplenishment of an audio data stream fragment buffer of the audioconference mixer of FIG. 2, in an embodiment of the present invention;and

FIG. 7 is a time diagram of an exemplary case in which chunks of audiodata streams arriving at the endpoint that hosts the audio conferencemixer from two peers engaged in a virtual conference are made availableto the audio conference mixer at different rates.

DETAILED DESCRIPTION OF AN EMBODIMENT OF THE INVENTION

Referring to the drawings, in FIG. 1 a scenario where the presentinvention is applicable is schematically shown. Reference numerals 105a, 105 b and 105 c denote three persons engaged in an virtual audioconference, exploiting respective communication terminals 110 a, 110 band 110 c, like for example video-telephones, Personal DigitalAssistants (PDAs), mobile phones, personal computers, interconnectedthrough a telecommunication network 115 that may include a wirelineand/or wireless telephone network, and a packet data network like theInternet.

The three persons (peers) 105 a, 105 b and 105 c involved in the virtualaudio conference can each talk to, and be seen by the other two peers,and each of the three peers can listen to and see the other two peers.This is made possible by a conference mixer, the main functionality ofwhich is to provide “virtual conferencing” user experience.

It is pointed out that the choice of considering just three conferencingpeers is merely dictated by reasons of description simplicity: thepresent invention is not so limited, and is applicable to any number ofconferencing peers.

Referring to FIG. 2, the main components of an audio conference mixeraccording to an embodiment of the present invention are schematicallyshown.

The conference mixer, denoted 205, is hosted by one of the terminals(endpoints) of the three peers engaged in the virtual audio conference,in the shown example at the endpoint 110 c of the peer 105 c (the “peerC”). It is sufficient that the endpoint of at least one of the peersinvolved in the virtual conference hosts the conference mixer (ingeneral, the conference mixer may be hosted by the endpoint of the peerthat initiates the virtual conference), however nothing prevents that aconference mixer is also hosted by one or more of the endpoints of theother peers.

The conference mixer 205 comprises a first mixer 220 configured toreceive audio data streams 225 a and 225 c, respectively received fromthe endpoint 110 a of the peer 105 a (the “peer A”) and generated by theendpoint 110 c of the peer C, and to generate a mixed audio data stream225 ac to be sent to the endpoint 110 b of the peer 105 b (the “peerB”). The audio data streams 225 a and 225 c are generated by capturingdevices like microphones 215 a and 215 c of the endpoints 110 a and 110c; at the endpoint 110 b, the mixed audio data stream 225 ac is renderedby a rendering device like a loudspeaker 210 b.

The conference mixer further comprises a second mixer 230, configured toreceive the audio data stream 225 c, generated by the endpoint 110 c ofthe peer C, and an audio data stream 225 b received from the endpoint110 b of the peer B, and to generate a mixed audio data stream 225 bc tobe sent to the endpoint 110 a of the peer A. The audio data stream 225 bis generated by a microphone 215 b of the endpoints 110 b; at theendpoint 110 a, the mixed audio data stream 225 bc is rendered by aloudspeaker 210 a.

The conference mixer further comprises a third mixer 235, configured toreceive the audio data stream 225 a, received from the endpoint 110 a ofthe peer A, and the audio data stream 225 b, received from the endpoint110 b of the peer B, and to generate a mixed audio data stream 225 ab,rendered by a loudspeaker 210 c of the endpoint 110 c.

In particular, the audio data streams are in digital format, and themixers are digital mixers.

It is pointed out that the number of data stream mixers of theconference mixer 205, as well as the number of data streams that each ofthe mixers is configured to receive and mix, depend on the number ofpeers engaged in the virtual conference.

It is also worth pointing out that the operation performed by theconference mixer 205 differs from the operation performed by a MCUprovided in the core network. The conference mixer 205 integrates theoperations of grabbing, rendering and mixing in a single device (thedevice 110 c), while a MCU in the core network performs all theoperation relevant to a rendering device in order to have access to theaudio samples to be mixed, mixes them and transmits (possiblycompressing) to the appropriate peer.

Referring to FIG. 3, there is schematically depicted the hardwarestructure of the endpoint 110 c that hosts the conference mixer 205.Essentially, it is the general structure of a data processing apparatus,with several units that are connected in parallel to an internal datacommunication bus 305. In detail, a data processor (microprocessor ormicrocontroller) 310 controls the operation of the terminal 110 c; a RAM(Random Access Memory) 315 is directly used as a working memory by thedata processor 310, and a ROM (Read Only Memory) 320 stores themicrocode (firmware) to be executed by the data processor 310. Acommunication subsystem 325 includes hardware devices for handling atleast the physical level of the communications over thetelephone/telecommunications network 115; a keyboard 330 is provided fordialing the telephone numbers; an audio/video subsystem 335 manages theloudspeaker/display device 210 c and the microphone/videocamera 215 c.

Passing to FIG. 4, the functional components of the conference mixer 205are shown in greater detail. In particular, FIG. 4 shows the partialcontent of the working memory 315 of the terminal 110 c during a virtualconference between the peers A, B and C; thus, the functional blocksdepicted in FIG. 4 are to be intended as software/firmware modules, orinstances of software/firmware modules. This is however not to beconstrued as a limitation of the present invention, which might beimplemented totally in hardware, or as a combination of hardware andsoftware/firmware.

Blocks 405 a and 405 b represent instances of a grabbing multimediasoftware module, adapted to perform the tasks of grabbing the mixedaudio data streams 225 bc′ and, respectively, 225 ac′, code them toobtain coded (i.e. compressed) data streams 225 bc and, respectively,225 ac, and of transmitting them over the telephone/telecommunicationsnetwork 115 to the endpoints 110 a and, respectively, 110 b of the peersA and B.

In a preferred embodiment of the present invention, the mixing operationof the audio data streams that generates the mixed audio data streams225 bc and 225 ac is performed in the uncompressed domain (on Pulse CodeModulated values); this avoids the problem of compatibility betweendifferent compression algorithms (for example G.723, AMR, G.722) thatmay have been negotiated between the different peers (in other words,peers A and C might have negotiated a compression algorithm differentfrom that negotiated between the peers B and C). In this case, thegrabbing multimedia software module instances 405 a and 405 b are alsoresponsible of the encoding the mixed audio data streams 225 bc and 225ac in a respective, predetermined coding standard, that may be differentfor the different peers A and B of the virtual conference.

Blocks 410 a and 410 b represent instances of a rendering multimediasoftware module adapted to perform the tasks (independent from thegrabbing tasks performed by blocks 405 a and 405 b) of receiving theaudio data streams 225 a and 225 b, respectively, transmitted by theendpoints 110 a and, respectively, 110 b over thetelephone/telecommunications network 115, decode the received audio datastreams 225 a and 225 b to obtain decoded data streams 225 a′ and 225b′, and render them through the loudspeaker 210 c of the endpoint 110 c.

The grabbing and rendering multimedia software module instances 405 a,405 b, 410 a and 410 b are user-space applications, running at theapplication layer level in the endpoint 110 c. As known to those skilledin the art, the working memory of a data processor (like the RAM 315 ofthe terminal 110 c) can ideally be divided into two basic memory spaces:a “user space” and a “kernel space”. The user space is the memory regionwhere the user software/firmware applications or executables reside andrun. The kernel space is the memory region where the kernelsoftware/firmware modules or executables reside; kernelsoftware/firmware modules are software/firmware modules forming the coreof an operating system which is started at the bootstrap of the dataprocessor, and whose function is to provide an environment in whichother programs can run, provide hardware services to them (likesupplying memory and access to space on storage devices), schedule theirexecution and allow multiple processes to coexist.

Generally, in order to access audio capturing and playing (grabbing andrendering) resources (the microphone 215 c, the loudspeaker, 210 c), thegrabbing and rendering multimedia software module instances 405 a, 405b, 410 a and 410 b exploit dedicated library modules, typicallyuser-space library modules.

As known to those skilled in the art, there are two possible approachesin accessing and using devices like microphones and loudspeakers: usinglibrary modules running in user space, or directly using kernel-spacedevice drivers Application Program Interfaces (APIs). Library modulesuse device driver APIs to control the device of interest. Kernel-spacedevice drivers are used when the operating system prevents user-spaceapplications to access directly the hardware resources; normally this isrelated to the presence of hierarchical protection domains (orprotection rings) in the operating system, acting as a protection methodfrom application-generated faults. Device drivers can run in user space(and thus act as a user-space library modules) when the operating systemdoes not implement protection rings concepts.

In an embodiment of the present invention, the conference mixer 205 isimplemented as a user-space process or thread, preferably of highpriority, running in the endpoint 110 c; in this case, the conferencemixer 205 can replace the library modules used by the instances 405 a,405 b, 410 a and 410 b of the user-space grabbing and renderingmultimedia modules (application layer) to access the capturing andrendering audio resources 210 c and 215 c.

Alternatively, the conference mixer 205 might be implemented as akernel-space device driver. In this case, the conference mixer 205replaces the kernel device driver normally responsible of handling theaudio grabbing and rendering operations.

Implementing the conference mixer functionality as a kernel-space devicedriver allows better exploiting the low latency benefit of treating datawithin an interrupt service routine, and avoiding charging the systemwith high priority user space threads/processes.

In FIG. 4, line 415 indicates an API exposed by the conference mixer205, through which it can be accessed by the grabbing and renderingmultimedia software module instances 405 a, 405 b, 410 a and 410 b. TheAPI 415 replicates the same functionalities provided by the librarymodules used to access the I/O resources 210 c and 215 c. The behaviorof the multimedia rendering and grabbing module instances 405 a, 405 b,410 a and 410 b does not need to be changed in order to allow theminteract with the conference mixer 205 (whose presence is thustransparent to the multimedia rendering and grabbing module instances405 a, 405 b, 410 a and 410 b).

The conference mixer 205 comprises (in the example herein considered ofvirtual conference involving three peers) three audio data stream chunkmixers 420, 425 and 430, adapted to mix portions (or chunks) of theaudio data streams. Two data buffers 420-1 and 420-2, 425-1 and 425-2,and 430-1 and 430-2 are operatively associated with each of the mixers420, 425 and 430. Data buffers 420-1 and 425-1 are the data contributionbuffers provided in respect of the audio data streams coming from thepeers A and, respectively, B, for grabbing purposes. Data buffers 420-2and 425-2 are the data recipient buffers for the mixed audio datastreams to be sent to the peers B and, respectively, A. Data buffers430-1 and 430-2 are the mixer contribution buffers for the audio datastreams received from the peers A and B, respectively, for renderingpurposes.

In particular, the mixers 420, 425 and 430 are digital mixers.

In case more than three peers are engaged in the virtual conference, thenumber of mixers and associated buffers increases; in particular, foreach additional peer participating to the virtual conference, a mixerlike the mixer 420, with an associated pair of buffers like the buffers420-1 and 420-2 needs to be added; also, a buffer like the buffer 430-1or 430-2 has to be added for each additional peer.

Reference numeral 435 denotes an audio rendering procedure, for sendingaudio data streams chunks ready to be rendered to the loudspeaker 210 c,for rendering them. Reference numeral 440 denotes an audio capturingprocedure, for receiving audio data captured by the microphone 215 c.

In case the conference mixer 205 is implemented as a user-space process,the mixing operations performed by the mixers 420, 425 and 430 may beimplemented as high priority threads/processes. For high priorityprocesses it is intended threads/processes running at a priority higherthan the normal tasks. For the purposes of the present invention, highpriority is intended to mean the highest possible priority (closest toreal-time priority), that does not jeopardize the system stability.Alternatively, if the conference mixer 205 is a kernel-space devicedriver, the mixing operations performed by the mixers 420, 425 and 430may be implemented as interrupt service routines, that are started assoon as an interrupt is received from the rendering and capturingdevices.

In particular, the mixing operations are performed at the Input/Output(I/O) rate (i.e., at the rate at which data are captured by thecapturing devices 215 c, and at the rate data to be rendered areconsumed by the rendering devices 210 c).

In detail, every time a new chunk of audio data is available from theinput interface of the microphone 215 c, an input mixing operation isperformed by the mixers 420 and 425: the next available chunk of datapresent in the data contribution buffer 420-1 and 425-1, respectively,is taken, and it is mixed with the new audio data chunk just captured bythe microphone 215 c.

Similarly, every time a new chunk of audio data is requested by theloudspeaker 210 c output interface, an output mixing operation isperformed by the mixer 430, using the next available chunk of audio datapresent in the mixer contribution buffers 430-1 and 430-2.

This guarantees the minimum end-to-end delay between the availability ofdata (from the microphone 215 c, or from the rendering multimediasoftware module instances 410 a and 410 b) and the production of data(for the grabbing multimedia software module instances 405 a and 405 band the loudspeaker 205 c).

The operation of the conference mixer 205 is explained in detail hereinbelow, referring to the flowchart of FIG. 5. In the followingexplanation, it is assumed that the rendering devices 210 c and thecapturing devices 215 c operate on a same time base, i.e. with a sameclock, so that they (their I/O interfaces) generate simultaneousinterrupts (i.e., they are synchronous); however, this is not to beconstrued as a limitation for the present invention, which applies aswell in the case the rendering devices 210 c and the capturing devices215 c operate with nominally equal but different clocks (the differentclocks can drift), and also in case the two clocks are not evennominally equal (in these cases, the interrupts generated by the (inputinterfaces of the) rendering and capturing devices are not, as a rule,simultaneous.

When the render time of a new audio data chunk arrives (block 505, exitbranch Y), the elder data chunks present in the mixer contributionbuffers 430-1 and 430-2 are taken (block 510), they are mixed togetheron the fly (block 515) and the mixed data are fed to the loudspeaker 210c for rendering (block 520). The arrival of the render time of the newaudio data chunk is an event 507 that may be signaled by a notification,like an interrupt from the (input interface of the) loudspeaker 210 c,if the conference mixer 205 is implemented as a kernel-space devicedriver, or said notification may be an asynchronous notification fromthe loudspeaker driver, in case the conference mixer 205 is implementedas a user-space process.

Under the above assumption that the rendering and capturing devicesoperate based on the same clock, the arrival of the render time of a newdata chunk coincides with the arrival of the grabbing time of a newaudio data chunk; however, in general, when the grabbing time of a newaudio data chunk arrives (also in this case, this event can be aninterrupt, if the conference mixer is implemented as a kernel-spacedevice driver, or it can be an asynchronous notification from themicrophone driver), the freshly captured audio data chunk is taken(block 525), the elder data chunks present in the data contributionbuffers 420-1 and, respectively, 425-1 are taken (block 530), and theyare mixed, by the mixers 420 and 425, with the freshly captured datachunk (block 535), to produce a new chunk of mixed audio to be sent tothe peers, and the mixed data are put in the data recipient buffers420-1 and 425-2, respectively (block 540). The grabbing multimediasoftware module instances 405 a and 405 b then fetch the elder mixedaudio data chunks from the respective data recipient buffer 425-2 and420-2.

The arrival of the render time of the new audio data chunk is also usedto trigger the load of the mixer contribution buffers 430-1 and 430-2with a new audio frame (i.e., a part, a fragment of the audio datastream) made available by the rendering multimedia software moduleinstances 410 a and 410 b (block 545), which accesses the conferencemixer 205 through the rendering API it exposes; in this description, theterm trigger means signaling to the multimedia rendering software moduleinstances 410 a and 410 b that free space is available in the associatedbuffer 430-0.1 and 430-2. The multimedia rendering software moduleinstances can then (when data are available) write a new audio frameinto the buffers 430-1 and 430-2; the same audio frame is also copiedinto the respective data contribution buffer 420-1 or 425-1.

Concerning the size of the buffers 420-1 and 420-2, 425-1 and 425-2, and430-1 and 430-2, a trade-off exists. In order to keep the end-to-enddelay low, the buffers should be kept as small as possible. However,this contrasts the requirement of having buffers as big as possible, inorder to avoid introducing glitches in the audio streams, due tounderflow in one of the data contribution buffers 420-1 and 425-1 (forexample, this happens when the data contribution buffer 420-1 is emptywhen a new captured audio data chunk arrives from the microphone 215 c).

Concerning the size of the audio data chunks to be stored in thebuffers, significant parameters for determining it are the number ofbits per audio sample (for example, 16 bits), the sampling rate (forexample, 8 kHz for narrow band compression algorithms like G.723, or 16kHz for wide band compression algorithms like G.722.2), the duration ofthe audio frame, i.e. the minimum amount of data handled by the chosencompression algorithm (for example, 10 ms for G.729, 30 ms for G.723, 20ms for G722). Thus, the data chunk size can be:

data chunk size=samplerate*bitpersample/8*durationofaudioframe.

When the endpoints 110 a and 110 b of the peers A and B use differentdata compression algorithms, the size of the audio frames coming fromthe two peers (and made available by the rendering multimedia softwaremodule instances) may be different. This introduces additional cases ofunderflow during the mixing operation: if the amount of data availablefrom all the different contributors of the mixing operation is notenough to produce a complete data chunk in output, a glitch isintroduced.

According to an embodiment of the present invention, the conferencemixer 205 may compute a single data chunk size, that is used for all thebuffers, using the lowest common multiple of all the audio frame sizesused by the different conference peers. For example, assuming that peerA transmits (and receives) audio frames of 30 ms, and peer B transmits(and receives) audio frames of 20 ms, the size of the audio data chunkused by the conference mixer (i.e., the “quantum” of audio dataprocessed by the conference mixer 205) may be 60 ms; this means thatevery 60 ms, an interrupt arrives from the loudspeaker 210 c, and a newdata chunk of 60 ms is generated on-the-fly by the mixer 430 and sent tothe loudspeaker 210 c (similarly, under the assumption that a singletime base exists, every 60 ms a new data chunk of audio captured by themicrophone 215 c is ready).

The number of data chunks that the generic buffer of the conferencemixer 205 is designed to store is tuned taking into consideration theinterrupt latency time or the task switch latency time provided by theequipment/operating system hosting the multimedia modules. Under theassumption that the latency time is lower than the duration (i.e. thesize) of a data chunk, the number of data chunks can be kept to theminimum (each buffer has two registers, each of the size of one datachunk, that are used in a “ping-pong” access mode, alternatively forreading and writing). In other words, when the system is able to processa single data chunk within its duration, an approach wherein each buffercan store two data chunks is regarded as a preferred method for handlingdata in order to minimize the end-to-end delay: while the processing ofa new data chunk is in progress, the system can feed the old data chunkto (from) the rendering (capturing) device. This concept is describedbetter in the description relevant to FIG. 6.

The minimum end-to-end delay introduced by the conference mixer 205 isequal to the duration of one audio data chunk between the peers A and Cand between the peers B and C (as described in FIG. 6). The additionalend-to-end delay introduced by the mixer is equal to the waiting timethat a data chunk has to wait in the buffer 430-1 or 430-2 beforerendering. In order to avoid underflow in one of the aforementionedbuffer, a minimum waiting time of one data chunk time is needed. Noadditional end-to-end delay is instead introduced in the grabbing pathfrom peer C to peer A or B. The buffers that introduce end-to-end delayare the buffers 420-1, 425-1, 430-1 and 430-2.

Referring to FIG. 6, there is schematically depicted the delayintroduced by the mixer contribution buffers 430-1 and 430-2. Theminimum level of buffering possible, without introducing artifactsduring the mixing operation due to underflow in one of the mixercontribution buffers 430-1 and 430-2, is equal to one data chunk; thus,the end-to-end delay is equal to one data chunk. In the drawing,INT(n−1), INT(n), INT(n+1), INT(n+2), INT(n+3) denote five consecutiveinterrupts from the rendering device (the loudspeaker 210 c), occurringat instants T_(n−1), T_(n), T_(n+1), T_(n+2), T_(n+3). InterruptINT(n−1) starts the rendering of the (n−1)-th audio data chunk,interrupt INT(n) starts the rendering of the n-th audio data chunk, andso on.

Assuming for the sake of simplicity that the data chunk size equals theaudio frame size, the generic one of the rendering multimedia softwaremodule instances 410 a or 410 b starts writing (event Ws(n)) the n-thaudio frame to the respective mixer contribution buffer 430-1 and 430-2at instant T_(n−1), when the audio rendering procedure 435 receives theinterrupt INT(n−1) for starting to play the (n−1)-th data chunk; thewriting of the n-th audio frame to the mixer contribution buffer ends(event We(n)) before the arrival of the next interrupt INT(n), thus whenthis next interrupt arrives the new audio data chunk is ready to beplayed; when, at instant T_(n), the audio rendering procedure 435receives the next interrupt INT(n) for starting to play the (n)-th datachunk, the rendering multimedia software module instance 410 a or 410 bstart writing (event Ws(n+1)) the (n+1)-th audio frame to the respectivemixer contribution buffers 430-1 and 430-2; the writing of the (n+1)-thaudio frame to the mixer contribution buffer ends (event We(n)) beforethe arrival of the next interrupt INT(n), so the new data chunk is readyto be played when the next interrupt INT(n+1) arrives, and so on.

Thanks to the fact that the mixing operation is performed on the fly inthe audio rendering procedure 435, starting at the receipt of theinterrupt (or of the asynchronous notification) from the renderingdevices, additional buffering and delays are avoided.

As mentioned in the foregoing, a possibility when the audio framescoming from/to be sent to the different peers are different in size, isto work with data chunks of size equal to the lowest common multiple ofthe sizes of the different audio frames. Referring to FIG. 7, in a timediagram similar to that of FIG. 6 there is depicted a case in which thetwo peers A and B adopt different audio frames; this translates intodifferent periods T_(A), T_(B) at which the rendering multimediasoftware module instances 410 a or 410 b make the audio framesavailable. Supposing that, at the instant T_(n−1) at which the renderingmultimedia software module instance 410 a starts making available then-th frame of the audio stream coming from peer A, the renderingmultimedia software module instance 410 b also starts making availablethe n-th frame of the audio stream coming from the peer B. At instantT_(n) the rendering multimedia software module instance 410 a completesthe n-th audio frame, and starts making available the (n+1) audio frame;the rendering multimedia software module instance 410 b insteadcompletes the n-th audio frame later, at instant T_(n+a). Similarly, therendering multimedia software module instance 410 a completes the(n+1)-th audio frame at instant T_(n+1), whereas the renderingmultimedia software module instance 410 b completes the (n+1)-th audioframe later, at instant T_(n+1+a). The rendering multimedia softwaremodule instance 410 a completes the (n+2)-th audio frame at instantT_(n+2), whereas the rendering multimedia software module instance 410 bcompletes the (n+2)-th audio frame later, at instant T_(n+3), when therendering multimedia software module instance 410 a completes the(n+3)-th audio frame: at this instant, the two rendering multimediasoftware module instances 410 a and 410 b are again synchronized. Inorder to prevent underflow at the instant T_(n+a), the mixercontribution buffers 430-1 and 430-2 may be designed so as to containdata chunks of size equal to four audio frames from peer A, and threeaudio frames from peer B.

Preferably, in order to keep the end-to-end delay as low as possible,the size of the data chunks used by the conference mixer 205 may beequal to maximum common divisor, instead of the lowest common multiple,of the audio frames adopted by the different peers. For example,assuming again that the audio frames from/to peer A are of 30 ms, andthose from/to the peer B are of 20 ms, the conference mixer 205 may bedesigned to work with data chunks of 10 ms; this means that every 10 ms,an interrupt arrives from the loudspeaker 210 c, and a new data chunk of10 ms is generated (by mixing data chunks of 10 ms of the audio framesstored in the mixer contribution buffers 430-1 and 430-2) on-the-fly bythe mixer 430 and sent to the loudspeaker 210 c; every two interrupts,the rendering multimedia software module instance 410 b is notified thatin the buffer 430-2 there is space for a new audio frame, while thishappens every three interrupts for the rendering multimedia softwaremodule instance 410 a.

This allows to perform mixing operation with an integer number of datachunk “quanta” on every buffer (on the contrary, should the data chunksize used in mixing be different for the different peers, more datawould have to be buffered in order to prevent artifacts due to bufferunderflow).

An advantage of implementing the conference mixer 205 in such a way thatit exposes the same APIs to the grabbing and rendering multimediasoftware module instances 405 a, 405 b, 410 a and 410 b as conventionalrendering/grabbing libraries or kernel device drivers make it suitableto be “transparently” inserted in an already working equipmentarchitecture, without impacting on the rendering and grabbing multimediamodules implementation.

In a preferred embodiment of the present invention, the usage of theconference mixer 205 is selectable by the user on a per-connectionbasis: during a communication involving just two peers, like for examplepeers A and C, or peers B and C, no mixing is needed, while the mixingbecomes necessary during a multi-peer (three or more) session.

Restricting the usage of the conference mixer 205 to those cases whereinmixing is really necessary allows reducing side effects, like increasingthe end-to-end delay during a conversation between just two peers.

In particular, the conference mixer 205 may be adapted to “autoenabling” when a third peer of a virtual conference starts producing anrequesting audio data chunks.

In principle, as long as only two peers are engaged in a communication,the conference mixer could be not used, thus avoiding introducingadditional end-to-end delay, and when a third peer enters thecommunication session, the mixer could be inserted. However, this live,on-the-fly insertion of the conference mixer is a time consumingoperation that might cause long artifacts. In a preferred embodiment ofthe invention, in order to both avoid adding end-to-end delay when themixing operation is not needed, and at the same time avoid the artifactscaused by the delay of insertion of the mixer when the third peer enterthe conference, the conference mixer 205 is already used since thebeginning of the communication between the first two peers, and thebuffering is reduced to the minimum required by the number of peersactively engaged in the conference. By “actively engaged” it is meanthaving “open” rendering/grabbing devices. The engagement of the mixerdoes not rely on the data flow (to or from its data interfaces) but onthe explicit intention of a multimedia software module (grabbing orrendering) to start a new session; data flow in fact can bediscontinuous as it is related to data availability for the remote peer.In particular, when only two peers, e.g. peers A and C are active, nomixing operations are needed and the conference mixer 205 preventsbuffering of audio data chunks in the data contribution buffer 420-1 andin the mixer contribution buffer 430-1. Data chunks are directly writtenby the rendering multimedia software module instance 410 a to the outputdevice 210 c, without any buffering. In this way, no extra delay isintroduced. In this condition, the thread 435 that, when the mixer isenabled, working synchronously with the interrupts received from theoutput rendering device 210 c, takes the data chunks from the mixercontribution buffers 430-1 and 430-2, when the mixer is disabled doesnot perform any mixing, but it is only responsible for signaling to therendering multimedia software module instance 410 a that the device isready for rendering. A similar behavior is performed by the grabbingthread. When a new peer, like peer B, becomes active, the mixingoperation, and the relevant buffering operations are re-enabled.

The rendering and grabbing multimedia software modules do not need tochange behavior when the conference-mixer is enabled. This reduces thecomplexity of the software itself, especially in a crucial part as thelow latency handling of audio.

The present invention has been here described considering some exemplaryembodiments. Those skilled in the art will readily appreciate thatseveral modifications to the described embodiments are possible, as wellas different embodiments, without departing from the scope of protectiondefined in the appended claims.

In particular, although in the above description reference has been madeto an audio conference, the present invention can also be applied toaudio/video virtual conferences, by adding the buffers and mixers forthe video component in an analogous manner as that described in theforegoing.

1-25. (canceled)
 26. A telecommunications terminal hosting a conferencemixer capable of being adapted to enable an at least audio conferencebetween a first conference peer and at least two further conferencepeers, the conference mixer comprising: for each of the at least twofurther conference peers, a respective first data buffer capable ofbeing configured to buffer portions of at least an audio data streamreceived from a respective conference peer; a first audio data streamportions mixer fed by the first data buffers and capable of beingconfigured to: a) get audio data stream portions buffered in the firstdata buffers; b) mix the audio data stream portions from the first databuffers to produce a first mixed audio data portion; and c) feed thefirst mixed audio data portion to a rendering device of thetelecommunications terminal, wherein said first audio data streamportions mixer is capable of being configured to perform operations a),b) and c) upon receipt of a notification from said rendering deviceindicating that the rendering device is ready to render a new mixedaudio data portion.
 27. The telecommunications terminal of claim 26,wherein said rendering device comprises a loudspeaker.
 28. Thetelecommunications terminal of claim 26, wherein each of said first databuffers comprises at least two storage areas each one capable of beingadapted to store at least one fragment of the audio data stream receivedfrom the respective peer.
 29. The telecommunications terminal of claim28, wherein each of said first data buffers is capable of beingconfigured to be fed by a respective rendering function, said renderingfunction capable of being adapted to receive the audio data stream fromthe respective conference peer, and to provide to the respective firstdata buffer fragments of predetermined size of the audio data streams.30. The telecommunications terminal of claim 29, wherein each of saidstorage areas has a size equal to a maximum common divisor of the sizesof the fragments of the audio data streams generated by the renderingfunctions of the at least two further peers.
 31. The telecommunicationsterminal of claim 29, wherein each of said storage areas has a sizeequal to a lowest common multiple of the sizes of the fragments of theaudio data streams generated by the rendering functions of the at leasttwo further peers.
 32. The telecommunications terminal of claim 26,further comprising: for each of the at least two further conferencepeers, a respective second data buffer capable of being configured tobuffer the portions of at least an audio data stream received from theother conference peer; a second audio data stream portions mixer fed bythe second data buffers and configured to: d) get audio data streamportions buffered in the second data buffers; e) get audio data streamportions from a capturing device of the telecommunications terminal; andf) mix the audio data stream portions from the second data buffers andthe audio data stream portions from the capturing device to produce arespective second mixed audio data portion to be sent to a respectiveone of said at least two further conference peers, wherein said secondaudio data stream portions mixer is configured to perform operations d),e) and f) upon receipt of a notification from said capturing deviceindicating that the capturing device has captured a new audio datastream portion.
 33. The telecommunications terminal of claim 32, furthercomprising: for each of the at least two further conference peers, arespective third data buffer capable of being configured to buffer thesecond mixed audio data portions produced by the respective second audiodata stream portions mixer.
 34. The telecommunications terminal of claim33, wherein each of said third data buffers is capable of beingconfigured to feed a respective grabbing function, said grabbingfunction capable of being adapted to transmit the audio data stream tothe respective conference peer.
 35. The telecommunications terminal ofclaim 29, further comprising a first application program interface forenabling access thereto by rendering functions.
 36. Thetelecommunications terminal of claim 34, further comprising a secondapplication program interface for enabling access thereto by saidgrabbing function.
 37. The telecommunications terminal of claim 35,wherein said conference mixer is capable of being configured to beimplemented as a user-space process running in a user space of thetelecommunications terminal.
 38. The telecommunications terminal ofclaim 35, wherein said conference mixer is capable of being configuredto be implemented as a kernel-space process running in a kernel space ofthe telecommunications terminal.
 39. The telecommunications terminal ofclaim 26, wherein said conference mixer is capable of being furtherconfigured to: detect the presence of said at least two furtherconference peers; automatically enable the first audio data streamportions mixer upon detection of said at least two further peers; andautomatically disable said first audio data stream portions mixer incase the presence of just one further peer is detected.
 40. Thetelecommunications terminal of claim 39, wherein said conference mixeris capable of being further configured to automatically enable thesecond audio data stream portions mixers upon detection of the presenceof said at least two further peers, and to automatically disable thesecond audio data stream portions mixers in case the presence of justone further peer is detected.
 41. A method of performing an at leastaudio conference between a first conference peer and at least twofurther conference peers, comprising: at a telecommunications terminalof the first conference peer, performing a first buffering of portionsof at least an audio data stream received from each of the at least twofurther conference peers; and upon receipt of a notification from arendering device of the telecommunications terminal, indicating that therendering device is ready to render a new mixed audio data portion:mixing the audio data stream portions buffered in said first buffering,to produce the mixed audio data portion; and feeding the mixed audiodata portion to the rendering device.
 42. The method of claim 41,wherein said rendering device comprises a loudspeaker.
 43. The method ofclaim 41, wherein said performing a first buffering comprises storing atleast one portion of the audio data stream received from a respectivepeer.
 44. The method of claim 43, comprising receiving said at least oneportion of the audio data stream to be stored from a respectiverendering function capable of being adapted to receive the audio datastream from the respective conference peer.
 45. The method of claim 44,wherein said at least one portion has a size equal to a maximum commondivisor of the sizes of audio data stream fragments generated by therendering functions of the at least two further peers.
 46. The method ofclaim 44, wherein said at least one portion has a size equal to a lowestcommon multiple of the sizes of audio data stream fragments generated bythe rendering functions of the at least two further peers.
 47. Themethod of claim 41, further comprising: performing a second buffering,for each of the at least two further conference peers, of the portionsof at least an audio data stream received from the other conferencepeer; and upon receipt of a notification from a capturing device of thetelecommunications terminal indicating that the capturing device hascaptured a new audio data stream portion: mixing the data streamportions buffered in said second buffering with an audio data streamportion captured by the capturing device to produce a respective secondmixed audio data portion to be sent to a respective one of said at leasttwo further conference peers.
 48. The method of claim 47, furthercomprising: performing, for each of the at least two further conferencepeers, a third buffering of the second mixed audio data portions. 49.The method of claim 48, further comprising: feeding the buffered secondmixed audio data portions to a respective grabbing function capable ofbeing adapted to transmit the audio data stream to the respectiveconference peer.
 50. The method of claim 41, further comprising:detecting the presence of said at least two further conference peers;automatically enabling said first buffering and mixing upon detection ofsaid at least two further peers; and automatically disabling said firstbuffering and mixing in case the presence of just one further peer isdetected.